131 research outputs found

    Generation-Based Data Augmentation for Offensive Language Detection: Is It Worth It?

    Get PDF
    Generation-based data augmentation (DA) has been presented in several works as a way to improve offensive language detection. However, the effectiveness of generative DA has been shown only in limited scenarios, and the potential injection of biases when using generated data to classify offensive language has not been investigated. Our aim is that of analyzing the feasibility of generative data augmentation more in-depth with two main focuses. First, we investigate the robustness of models trained on generated data in a variety of data augmentation setups, both novel and already presented in previous work, and compare their performance on four widely-used English offensive language datasets that present inherent differences in terms of content and complexity. In addition to this, we analyze models using the HateCheck suite, a series of functional tests created to challenge hate speech detection systems. Second, we investigate potential lexical bias issues through a qualitative analysis on the generated data. We find that the potential positive impact of generative data augmentation on model performance is unreliable, and generative DA can also have unpredictable effects on lexical bias

    FBK-DH at SemEval-2020 Task 12: Using Multi-channel BERT for Multilingual Offensive Language Detection

    Get PDF
    In this paper we present our submission to sub-task A at SemEval 2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval2). For Danish, Turkish, Arabic and Greek, we develop an architecture based on transfer learning and relying on a two-channel BERT model, in which the English BERT and the multilingual one are combined after creating a machine-translated parallel corpus for each language in the task. For English, instead, we adopt a more standard, single-channel approach. We find that, in a multilingual scenario, with some languages having small training data, using parallel BERT models with machine translated data can give systems more stability, especially when dealing with noisy data. The fact that machine translation on social media data may not be perfect does not hurt the overall classification performance

    Spectrum of mutations in Italian patients with familial hypercholesterolemia: New results from the LIPIGEN study

    Get PDF
    Background Familial hypercholesterolemia (FH) is an autosomal dominant disease characterized by elevated plasma levels of LDL-cholesterol that confers an increased risk of premature atherosclerotic cardiovascular disease. Early identification and treatment of FH patients can improve prognosis and reduce the burden of cardiovascular mortality. Aim of this study was to perform the mutational analysis of FH patients identified through a collaboration of 20 Lipid Clinics in Italy (LIPIGEN Study). Methods We recruited 1592 individuals with a clinical diagnosis of definite or probable FH according to the Dutch Lipid Clinic Network criteria. We performed a parallel sequencing of the major candidate genes for monogenic hypercholesterolemia (LDLR, APOB, PCSK9, APOE, LDLRAP1, STAP1). Results A total of 213 variants were detected in 1076 subjects. About 90% of them had a pathogenic or likely pathogenic variants. More than 94% of patients carried pathogenic variants in LDLR gene, 27 of which were novel. Pathogenic variants in APOB and PCSK9 were exceedingly rare. We found 4 true homozygotes and 5 putative compound heterozygotes for pathogenic variants in LDLR gene, as well as 5 double heterozygotes for LDLR/APOB pathogenic variants. Two patients were homozygous for pathogenic variants in LDLRAP1 gene resulting in autosomal recessive hypercholesterolemia. One patient was found to be heterozygous for the ApoE variant p.(Leu167del), known to confer an FH phenotype. Conclusions This study shows the molecular characteristics of the FH patients identified in Italy over the last two years. Full phenotypic characterization of these patients and cascade screening of family members is now in progress

    Clinical features and outcomes of elderly hospitalised patients with chronic obstructive pulmonary disease, heart failure or both

    Get PDF
    Background and objective: Chronic obstructive pulmonary disease (COPD) and heart failure (HF) mutually increase the risk of being present in the same patient, especially if older. Whether or not this coexistence may be associated with a worse prognosis is debated. Therefore, employing data derived from the REPOSI register, we evaluated the clinical features and outcomes in a population of elderly patients admitted to internal medicine wards and having COPD, HF or COPD + HF. Methods: We measured socio-demographic and anthropometric characteristics, severity and prevalence of comorbidities, clinical and laboratory features during hospitalization, mood disorders, functional independence, drug prescriptions and discharge destination. The primary study outcome was the risk of death. Results: We considered 2,343 elderly hospitalized patients (median age 81 years), of whom 1,154 (49%) had COPD, 813 (35%) HF, and 376 (16%) COPD + HF. Patients with COPD + HF had different characteristics than those with COPD or HF, such as a higher prevalence of previous hospitalizations, comorbidities (especially chronic kidney disease), higher respiratory rate at admission and number of prescribed drugs. Patients with COPD + HF (hazard ratio HR 1.74, 95% confidence intervals CI 1.16-2.61) and patients with dementia (HR 1.75, 95% CI 1.06-2.90) had a higher risk of death at one year. The Kaplan-Meier curves showed a higher mortality risk in the group of patients with COPD + HF for all causes (p = 0.010), respiratory causes (p = 0.006), cardiovascular causes (p = 0.046) and respiratory plus cardiovascular causes (p = 0.009). Conclusion: In this real-life cohort of hospitalized elderly patients, the coexistence of COPD and HF significantly worsened prognosis at one year. This finding may help to better define the care needs of this population

    Familial hypercholesterolemia: The Italian Atherosclerosis Society Network (LIPIGEN)

    Get PDF
    BACKGROUND AND AIMS: Primary dyslipidemias are a heterogeneous group of disorders characterized by abnormal levels of circulating lipoproteins. Among them, familial hypercholesterolemia is the most common lipid disorder that predisposes for premature cardiovascular disease. We set up an Italian nationwide network aimed at facilitating the clinical and genetic diagnosis of genetic dyslipidemias named LIPIGEN (LIpid TransPort Disorders Italian GEnetic Network). METHODS: Observational, multicenter, retrospective and prospective study involving about 40 Italian clinical centers. Genetic testing of the appropriate candidate genes at one of six molecular diagnostic laboratories serving as nationwide DNA diagnostic centers. RESULTS AND CONCLUSIONS: From 2012 to October 2016, available biochemical and clinical information of 3480 subjects with familial hypercholesterolemia identified according to the Dutch Lipid Clinic Network (DLCN) score were included in the database and genetic analysis was performed in 97.8% of subjects, with a mutation detection rate of 92.0% in patients with DLCN score 656. The establishment of the LIPIGEN network will have important effects on clinical management and it will improve the overall identification and treatment of primary dyslipidemias in Italy

    Familial hypercholesterolemia: The Italian Atherosclerosis Society Network (LIPIGEN)

    Get PDF
    Background and aims Primary dyslipidemias are a heterogeneous group of disorders characterized by abnormal levels of circulating lipoproteins. Among them, familial hypercholesterolemia is the most common lipid disorder that predisposes for premature cardiovascular disease. We set up an Italian nationwide network aimed at facilitating the clinical and genetic diagnosis of genetic dyslipidemias named LIPIGEN (LIpid TransPort Disorders Italian GEnetic Network). Methods Observational, multicenter, retrospective and prospective study involving about 40 Italian clinical centers. Genetic testing of the appropriate candidate genes at one of six molecular diagnostic laboratories serving as nationwide DNA diagnostic centers. Results and conclusions From 2012 to October 2016, available biochemical and clinical information of 3480 subjects with familial hypercholesterolemia identified according to the Dutch Lipid Clinic Network (DLCN) score were included in the database and genetic analysis was performed in 97.8% of subjects, with a mutation detection rate of 92.0% in patients with DLCN score \ue2\u89\ua56. The establishment of the LIPIGEN network will have important effects on clinical management and it will improve the overall identification and treatment of primary dyslipidemias in Italy

    Transfer Learning for Multilingual Offensive Language Detection with BERT

    No full text
    The popularity of social media platforms has led to an increase in user-generated content being posted on the Internet. Users, masked behind what they perceive as anonymity, can express offensive and hateful thoughts on these platforms, creating a need to detect and filter abusive content. Since the amount of data available on the Internet is impossible to analyze manually, automatic tools are the most effective choice for detecting offensive and abusive messages. Academic research on the detection of offensive language on social media has been on the rise in recent years, with more and more shared tasks being organized on the topic. State-of-the-art deep-learning models such as BERT have achieved promising results on offensive language detection in English. However, multilingual offensive language detection systems, which focus on several languages at once, have remained underexplored until recently. In this thesis, we investigate whether transfer learning can be useful for improving the performance of a classifier for detecting offensive speech in Danish, Greek, Arabic, Turkish, German, and Italian. More specifically, we first experiment with using machine-translated data as input to a classifier. This allows us to evaluate whether machine translated data can help classification. We then experiment with fine-tuning multiple pre-trained BERT models at once. This parallel fine-tuning process, named multi-channel BERT (Sohn and Lee, 2019), allows us to exploit cross-lingual information with the goal of understanding its impact on the detection of offensive language. Both the use of machine translated data and the exploitation of cross-lingual information could help the task of detecting offensive language in cases in which there is little or no annotated data available, for example for low-resource languages. We find that using machine translated data, either exclusively or mixed with gold data, to train a classifier on the task can often improve its performance. Furthermore, we find that fine-tuning multiple BERT models in parallel can positively impact classification, although it can lead to robustness issues for some languages

    DiatopIt: A Corpus of Social Media Posts for the Study of Diatopic Language Variation in Italy

    No full text
    We introduce DiatopIt, the first corpus specifically focused on diatopic language variation in Italy for language varieties other than Standard Italian. DiatopIt comprises over 15K geolocated social media posts from Twitter over a period of two years, including regional Italian usage and content fully written in local language varieties or exhibiting code-switching with Standard Italian. We detail how we tackled key challenges in creating such a resource, including the absence of orthography standards for most local language varieties and the lack of reliable language identification tools. We assess the representativeness of DiatopIt across time and space, and show that the density of non-Standard Italian content across areas correlates with actual language use. We finally conduct computational experiments and find that modeling diatopic variation on highly multilingual areas such as Italy is a complex task even for recent language models

    Hate Speech Detection with Machine-Translated Data: The Role of Annotation Scheme, Class Imbalance and Undersampling

    Get PDF
    While using machine-translated data for supervised training can alleviate data sparseness problems when dealing with less-resourced languages, it is important that the source data are not only correctly translated, but also follow the same annotation scheme and possibly class balance as the smaller dataset in the target language. We therefore present an evaluation of hate speech detection in Italian using machine-translated data from English and comparing three settings, in order to understand the impact of training size, class distribution and annotation scheme
    corecore